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Abstract 

Background: Dealing with high dimensional markers, such as gene expression data obtained using microarray 
chip technology or genomics studies, is a key challenge because the numbers of features greatly exceeds the 
number of biological samples. After selecting biologically relevant genes, how to summarize the expression of 
selected genes and then further build predicted model is an important issue in medical applications. One intuitive 
method of addressing this challenge assigns different weights to different features, subsequently combining this 
information into a single score, named the compound covariate. Investigators commonly employ this score to 
assess whether an association exists between the compound covariate and clinical outcomes adjusted for baseline 
covariates. However, we found that some clinical papers concerned with such analysis report bias p-values based 
on flawed compound covariate in their training data set. 

Results: We correct this flaw in the analysis and we also propose treating the compound score as a random 
covariate, to achieve more appropriate results and significantly improve study power for survival outcomes. With 
this proposed method, we thoroughly assess the performance of two commonly used estimated gene weights 
through simulation studies. When the sample size is 100, and censoring rates are 50%, 30%, and 10%, power is 
increased by 10.6%, 3.5%, and 0.4%, respectively, by treating the compound score as a random covariate rather 
than a fixed covariate. Finally, we assess our proposed method using two publicly available microarray data sets. 

Conclusion: In this article, we correct this flaw in the analysis and the propose method, treating the compound 
score as a random covariate, can achieve more appropriate results and improve study power for survival outcomes. 



Introduction 

High-dimensional omics data 

Personalized medicine is expected to enable a more pre- 
dictive discipline, in which therapies are targeted toward 
the molecular constitution of individual patients and 
their disease; thus, molecular biomarkers are widely 
expected to revolutionize the current practice of medi- 
cine. For example, the progress of genomics has made it 
possible to evaluate molecular signatures to predict can- 
cer metastasis [1,2]. Various technological breakthroughs 
have led to a plethora of high-dimensional omics data to 
support personalized medicine, and these data have a 
common characteristic: the numbers of features greatly 
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exceeds the number of biological samples. Because biolo- 
gical phenomena are the result of sets of features (e.g., 
concerted expression of multiple genes), the analysis of a 
group of related features (e.g., genes) may be more effec- 
tive and may provide more directly interpretable results 
than the analysis of individual genes. 

As high-dimensional omics research has advanced, the 
compound covariate (or compound score) has generally 
been held as a simpler and more straightforward 
approach. After selecting biologically relevent genes in 
training cohort, such a score is often a useful device in 
medical applications to define the information contained 
in a single set of data and to summarize the association 
of a set of variables with disease. Tukey [3] first advo- 
cated the use of compound covariates in the clinical trial 
setting. To develop a compound score, the individual 
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covariates are summed; the association between such a 
compound covariate and outcome then is evaluated via 
regression analysis. Tomasson [4] used a compound 
score for binary outcomes, via fitting a logistic regression. 
Later, Hedenfalk [5] successfully applied the compound 
covariate method to class prediction analysis for breast 
cancer data. Because the use of the compound covariate 
is intuitive and seems useful, many other leading 
researchers also have applied this method for analyzing 
omics data sets [6-9]. 

Problem statements 

A compound covariate is a linear combination of the 
basic covariates being studied, with each covariate having 
its own coefficient or weight. For survival outcomes, a 
commonly used scheme is to 1) compute the univariate 
Cox regression [10] for each gene of interest, 2) assign a 
weight to each gene (typically, the estimated regression 
coefficients or Wald statistics from the univariate Cox 
regressions), and 3) combine the weighted genes in a lin- 
ear model that incorporates gene expression levels in 
each sample. This method of modeling weighted genes is 
believed to reflect the importance of each individual gene 
to the outcome; the higher the weight assigned, the more 
significant a particular gene is. 

However, the linear combination of the group of genes, 
with each gene having its own estimated weight, should 
not be treated as an observed covariate or fixed covariate. 
Because selected weights are estimated through computing 
the univariate Cox regression of each individual gene, the 
compound "covariate" should be treated as a random cov- 
ariate that includes estimated error. Besides, for the pur- 
pose of assessing whether an association exists between 
the compound covariate and survival outcomes, Cox 
regression is typically used to evaluate the significance 
level of the parameter of the compound score. However, 
bias concerns arise when the same data set, training 
cohort, is used for a double purpose: to construct the 
compound covariate and then to test it. This framework 
results in an over-fitting problem. As shown in Figure 1, 
we simulate 50 observations with 3, 5, 10, and 30 non-cau- 
sal genes used to create a compound covariate. The Cox 
regression model is then used to test whether an associa- 
tion exists between the compound scores and survival out- 
comes in the same dataset. The distribution of p-values 
should be uniform in the interval [0, 1]. In our simulation, 
however, p-values are concentrated around zero, especially 
as the number of genes increases. This demonstrates that 
type I error rates are inflated and consequently not con- 
trolled. Obviously, double using the training cohort casue 
overfitting problem and bias p-values arised. We found 
that some medical papers report inappropriate p-values 
for testing training cohort data [11,6]. Although the pro- 
posed bias p-values in their training cohort do not affact 



their final research results inferred from another indepen- 
dent testing data set, these observations motive us to 
study a more appropriate testing procedure for the com- 
pound covariate if the investagtors whish to report a test- 
ing result in the training cohort. 

In this paper, we first contend the compound covariate 
should be treated as a random observation. Our idea is 
based on that proposed by Prentice [12], who analyzed 
covariates with measurement error and used a partial like- 
lihood function technique to infer whether the parameter 
for the covariate was significant. In addition, if a training 
data set is used for a double purpose (i.e., to construct the 
compound covariate and then to test it), the resulting 
over-fitting means the p-value is not reliable when testing 
the regression parameter. Therefore, we use a 2-fold 
method (e.g., [13,14]), splitting all observations in the 
training cohort into two parts, one part for assigning gene 
weights, and another part for testing the regression para- 
meter through a partial likelihood score test. The remain- 
der of this paper is organized as follows: We outline 
creation of the compound score using a random covariate 
approach. Then, we investigate the accuracy of the asymp- 
totic distribution of the proposed tests. We thoroughly 
assess the performance of two commonly used estimated 
weights, "estimated coefficient" and "Wald statistic", for 
the Cox proportional hazards model. Finally, we illustrate 
the implementation of the proposed methods through two 
real data sets, and offer concluding remarks. 

The proposed method 

The compound covariate 

In this section, we formally define some notations for 
compound covariate and introduce a procedure to identify 
whether a set of genes is truly associated with survival 
times in the training cohort. Let 7} denote the survival 
time and C, denote the censoring time independent of 7} 
for the ;'th patient, with /' = 1, 2, n. When some observa- 
tions are right-censored, one can observe only random 
variables Xj = min(7}, C ; ) and Sj = I(Tj < Cj), where I{A) is 
an indicator function of event A, assuming the value 1 if 
event A occurs and the value 0 otherwise. Let Xj = (xji, Xj 2 , 
Xj p ) be the standardized selected gene's intensity in the 
/'th patient, where p is the size of the gene set. For creating 
the compound covariate, one first fits the univariate Cox 
regression model for each individual gene, that is, 

h (t\xk) = h ok (t) exp (XkPk) , k = 1, 2 p (1) 

where h ok (t) is a baseline hazard function of each gene 
k, and fi^ is a corresponding parameter to be estimated. 

Let jSi, (62, ■ ■ ■ , Pp be the estimators of Pi, /3 2 : —, P p , and 
W\, u>2, ■ ■ ■ ,Wp be the Wald statistics obtained from (1). 
A compound covariate commonly used by clinicians is 
defined as 
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Figure 1 P-value distribution under the null hypothesis with nominal level 0.05. 
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or another possible weighting policy depends on Wald 
statistics, 



for each patient /, /' = 1, 2, n. From the perspective 
of biology, the weighting policy is believed to reflect the 



importance of each individual gene to survival outcome, 
the higher the weight, the more important the gene is. 
In other words, the score can be regarded as a con- 
densed index, representing the collective effects of gene 
expression. 

To identify whether this gene expression pattern is 
truly associated with survival in training cohort, investi- 
gators prefer to use Cox regression analysis. That is, after 
fitting model (1), they construct 

h(t\z) = h 0 (t) exp (yqz) 
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where h 0 (t) is a baseline hazard function and y 0 is a cor- 
responding parameter for the compound covariate z; they 
then use the same data set to test the null hypothesis H 0 : 
Yo = 0. Under the null hypothesis, however, the method 
results in uncontrolled type I error, because the training 
data set has been used twice, both for building the model 
and for testing the regression parameter. If independent 
data are available, carry f} lt f} 2 , . . . , ftp from training data 
set and test on another independent data set is possible, 
allowing unbiased model validation to prevent over-fitting. 
However, if investagtors whish to report a testing result in 
the training cohort, an alternative, though less optimal, 
study design is using k-fold method or split the training 
cohort data randomly (2-fold), with 50% of the data being 
assigned to develop the score, compound covariate, and 
50% to evaluate its performance. The limitation of this 
approach is that it requires a relatively large sample size. 
With this method, Kaplan-Meier survival curves [15] for 
the two sets should be examined to ensure no significant 
difference by the random selection of those two sets from 
training cohort data. 

Cox regression with a random compound covariate 

The measuring mechanism makes the compound covari- 
ate an estimation and not a fixed observable. Naturally, 
such a covariate should be treated as a random covariate, 
and the variance of each score needs to be taken into 
account. To fit a Cox regression model with a random 
covariate, we use the idea advocated by Prentice, which 
presents the Cox model as a multiplicative hazards 
model, with a relative risk at time t, 



E [exp (y 0 z)] . 



(2) 



This is a weighted average of a possible relative risk 
given the covariate z. The Cox regression model then can 
be written as 

h (t\z) = h 0 (t) E [exp (y 0 z)] . 

Because omics data sets involve a large number of fea- 
tures, we assume the distributions of the scores follow 
normal distribution. That is, for each patient /', assuming 
Zj is drawn from a normal distribution with mean and 
variance of, (2) can be derived as 



exp \ yofij + -of K 0 2 



Thus, a quadratic term, of ]/q/2, is added to the rela- 
tive risk function, accounting for the variance in the 
random covariate. In addition, with both random covari- 
ates z (in this case, the compound covariate) and 
observed covariates w with dimension d (in this case, 



the clinical variable), it is easy to incorporate the fixed 
covariate effects into the Cox model, as: 

h (t\z, w) = h 0 (t) E [exp (y 0 z + y\w)] , 

where y 0 is the parameter for the compound covariate, 
Yi is the corresponding parameters for fixed observa- 
tions and y I is the transpose of Y i. 

A partial likelihood function and score test 

Suppose now that t 1 < ... <ti are the ordered distinct sur- 
vival times in the sample, and let R{t,) and F(ti) denote 
the risk set prior to t t and the set of subjects failing at t it 
respectively. The partial likelihood function is: 



L(Y) = U 



1 EjfeRft) E [^P (KOZ + Y\^)T ' 



where m t is the number of failures at time t t {i = 1, %..X). 
Let a = E [exp [y Q z + yJ w }]> b = da/dy and c = db/dy 
where y = [y 0 , j/T] T (The explicit forms of a, b and c are 

shown in Additional file 1). The score statistic then can be 
derived as 



_ 91ogL(y) ^ £^ 



dy 



JL _ mt ^^ 1 x (3) 



with the observed information matrix V 



3 2 logL(y) Ej S R(i,) c y v ^ ( T,jeR(td b >) 



dy 1 



j=1 J2jsR(t,) a ii 



EE \1 



.=1 jeFft) \ ai > a \ 



(4) 



Consequently, under the null hypothesis Ho : y = 0, 
the partial likelihood score test v T V' 1 v will have an 
asymptotic xj+i distribution when V is nonsingular. 
In addition, (3) can be used in a standard manner for 
^estimation. In practice, we can calculate a p-value by 
replacing fij with Zj and of with Var(z y ) where 

P . . P 
Var( Zj ) = J2 Vai \ x ikPk) = 

k=i k=\ 

is derived based on approximation and s^ k is the var- 
iance estimation of f} k by using estimated coefficient as 
weight or 

P P 

Var ( z i) = E Var ( x )kU>ksign (ft)) = 3 E x fk 

k=i k=\ 
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by using Wald statistics as weight. We show the deri- 
vation in more detail in Additional file 2. 

Multiple gene sets 

Further, we extended the compound covariate to 
multiple gene sets. If there are q given independent 

gene sets for the /th patient, x-^xj^, . . . ,Xj , where 

x* s) = x^j, . . . , s = 1, 2, q , the compound 
covariates can be written as vector 

>» = (4M 2) f) - (e^&E^a E$a) -or 

/ h Pi P, \ 

z * = ( E x< jk <bhSi z D (a) ■ E 4 2) " ,,iSi s n (a) E 4fe" ,, ' si g n (a) J 

where is the number of genes of the sth gene set. 
Then, the partial likelihood function can be written as 

1( , A UieF {ti) E [ ex p(y T o z+ y T i w )] 
<=i Ei 6 ji( <1 )4 e ^ , ()'o z+ )'i w )J 

Let a = E [exp (yjz + yjw)], b = 3a/3y and c = db/dy, 

where y = [y J, j/|] T . The score statistic and the observed 

information matrix can be further derived as (3) and (4) as 
well. Consequently, under the null hypothesis Ho : y = 0, 
the partial likelihood score test v T V 1 v has an asymptotic 

xJ +(? distribution when Vis nonsingular. If we reject the 
null hypothesis, we can conclude that the covariate vector 
is associated with survival time. 

Simulation results 

To assess the performance of the proposed testing proce- 
dure for compound covariate, we conducted simulation 
studies under various scenarios to study type I error rate 
and power. For the scenario of split training data set as 
two parts and the consideration of compound scores as 
random covariates, we denoted the compound score using 
/§ as a weight function as SRCb, and the compound score 
using the Wald statistic as a weight function as SRC W . 
The corresponding notations, SC B and SC W , refer to split 
data but without treating the compound covariate as a 
random covariate (i.e., typical Cox regression [10]). The 
notations DC B and DC W refer to compound scores with 
double use of the training data set, both for building the 
model and then testing for the same data set. Assume 
there are p selected genes in a gene set. Gene expression 
data were generated from a multivariate normal distribu- 
tion with mean vector /J = \fi lt j3 2 , •••> P P ] T > and variance- 
covariance matrix equal to the identity matrix for n cases. 
Generated survival times were associated with gene 



expression via a proportional hazard model, exp(# T /J). 
All tests with nominal significance level 0.05 were applied 
and empirical rejection probability was obtained based on 
2000 simulation runs. 

For comparing empirical type I error rates, the value of 
P was set to 0. The total sample size was set to 50, 75, or 
100. After gene selection process, we assume the total 
number of disease relative genes was set to 10, 30, 50, or 
70. Censoring times (denoted as cen.) were generated 
from an exponential distribution, and the overall censor- 
ing fraction in either setup was fixed at 10% or 40%. 
Table 1 shows the empirical type I error rates. As shown, 
the proposed procedure for SRC B and SRC W preserve 
reasonable type I error rates. The methods DC B and 
DC W , however, which make double use of the data set, 
do not control type I error. Note that we do not show 
type I error for SC B and SC W methods, because these 
methods are typical Cox regression. 

Because type I error rates are preserved for both the 
SRC B /SRC W and SC B ISC W methods, we compared the 
power under each method in Table 2. In this simulation, 
censoring times were also generated from an exponential 
distribution, and the overall censoring fraction in either 



Table 1 Empirical type I error rates 



Method 


n 


cen. 


The total num 


ber of genes 








10 


30 


50 


70 


SRC B 


50 


10% 


0.052 


0.057 


0.051 


0.048 






40% 


0.041 


0.047 


0.045 


0.046 




75 


10% 


0.052 


0.048 


0.045 


0.046 






40% 


0.044 


0.046 


0.050 


0.046 




100 


10% 


0.056 


0.049 


0.052 


0.052 






40% 


0.045 


0.044 


0.048 


0.050 


SRC W 


50 


10% 


0.058 


0.046 


0.052 


0.050 






40% 


0.034 


0.046 


0.036 


0.043 




75 


10% 


0.046 


0.042 


0.051 


0.051 






40% 


0.044 


0.038 


0.044 


0.040 




100 


10% 


0.051 


0.046 


0.048 


0.060 






40% 


0.044 


0.041 


0.046 


0.048 


DC B 


50 


10% 


0.937 


1.000 


1.000 


1.000 






40% 


0.910 


1.000 


1.000 


1.000 




75 


10% 


0.944 


1.000 


1.000 


1.000 






40% 


0.946 


1.000 


1.000 


1.000 




100 


10% 


0.957 


1.000 


1.000 


1.000 






40% 


0.952 


1.000 


1.000 


1.000 


DC W 


50 


10% 


0.926 


1.000 


1.000 


1.000 






40% 


0.916 


1.000 


1.000 


1.000 




75 


10% 


0.920 


1.000 


1.000 


1.000 






40% 


0.936 


1.000 


1.000 


1.000 




100 


10% 


0.929 


1.000 


1.000 


1.000 






40% 


0.933 


1.000 


1.000 


1.000 



Empirical type I error rates for comparing SRC a SRC W DC B and DC W methods. 
The value of /S was set to 0. 
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Table 2 Power comparison under two different scenario 



n 


cen. 




Scenarios 1 






Scenarios 2 








SRC B 


SC B 


SRC W 


sc w 


SRC B 


sc B 


SRC W 


SC W 


Strong effect: P = [fl h p 2 , ... 


, P30J — 


[1,1, 


1] T 








50 


10% 


0.757 


0.742 


0.675 


0.650 


0.600 


0.599 


0.723 


0.692 




30% 


0.624 


0.580 


0.536 


0.490 


0.480 


0.422 


0.546 


0.505 




50% 


0.448 


0.350 


0.381 


0.312 


0.350 


0.250 


0.359 


0.294 


75 


10% 


0.960 


0.956 


0.907 


0.902 


0.876 


0.870 


0.944 


0.942 




30% 


0.883 


0.864 


0.828 


0.766 


0.783 


0.771 


0.875 


0.822 




50% 


0.758 


0.626 


0.690 


0.526 


0.607 


0.494 


0.694 


0.580 


100 


10% 


0.998 


0.997 


0.985 


0.982 


0.974 


0.973 


0.996 


0.995 




30% 


0.982 


0.974 


0.948 


0.917 


0.940 


0.905 


0.966 


0.955 




50% 


0.928 


0.846 


0.868 


0.730 


0.806 


0.695 


0.883 


0.802 


Low effect: j6 


= 


Pi, /3 30 ] T = [0.5, 0.5, ... 


, 0.5] T 








50 


10% 


0.666 


0.625 


0.594 


0.576 


U.zoo 


0.242 


0.326 


U.31D 




30% 


0.498 


0.487 


0.440 


0.430 


0.206 


0.165 


0.224 


0.214 




50% 


0.362 


0.285 


0.328 


0.244 


0.144 


0.122 


0.162 


0.124 


75 


10% 


0.930 


0.924 


0.859 


0.850 


0.492 


0.466 


0.574 


0.570 




30% 


0.824 


0.756 


0.756 


0.688 


0.370 


0.325 


0.432 


0.421 




50% 


0.642 


0.553 


0.571 


0.469 


0.263 


0.206 


0.312 


0.224 


100 


10% 


0.992 


0.990 


0.964 


0.950 


0.662 


0.654 


0.796 


0.792 




30% 


0.959 


0.944 


0.918 


0.850 


0.558 


0.505 


0.652 


0.594 




50% 


0.852 


0.760 


0.802 


0.654 


0.412 


0.319 


0.472 


0.370 



We compared the power under each method SRC B ISRC W and SCg/SCw. 
The first scenario considers 30 disease related genes. The second scenario 
considers 30 disease related genes, with the other 27 genes considered 
no effect. 



setup was fixed at 10%, 30%, or 50%. We then simulated 
30 total genes in one gene set under two different scenar- 
ios. The first scenario considers 30 disease related genes. 
We designed two different levels of effect, strong effect 
and low effect, as 

P = IPi, Pi, ■ ■ ■ , Pio? = [1,1,. ..,1] T , 

P = [fiu y8 30 ] T = [0.5, 0.5, . . . , 0.5] T , 

respectively. The second scenario considers 3 disease 
related genes, with the other 27 genes considered "noise" 
(i.e., no effect). Strong effect and low effect in this case 
were set as 

P = IPi, Pi, ■ ■ ■ , Pio? = [1,1,1,0...,0] T , 

P = IPu Pi,-.-, Psof = [0.5, 0.5, 0.5, . . . , Of. 
Results are shown in Table 2. 

For scenario 1, all 30 genes have effects. As expected, the 
power of the tests increases with increase in total sample 
size and gene effect, but decreases as the censoring pro- 
portion grows. Under the first scenario, the power of 
the SRC B method is always better than that of SC B , and 
SRC W is always better than SC W . This result indicates that 



treating the compound score as a random covariate yields 
higher power than treating the score as a fixed covariate. 
When the sample size is 100, the power average increases 
10.6, 3.5, and 0.4 percentage points for 50%, 30%, and 10% 
censoring, respectively. This is a reasonable result, because 
the random covariate approach involves fitting a quadratic 

Cox regression model, exp ^yoMj + °j 2 Ko/2^ , instead of 

exp(y n , ft,)- The quadratic form takes into account the var- 
iance of each score, use of the compound score without 
acknowledgement of covariate error yields lower power. 

To further illustrate the effect of treating the compound 
score as a random covariate, in Figure 2, we show the 
power curves of the SRC® SCb, SRC w , and SC W methods 
for gene effect varying from -1 to 1. The total sample size 
was set to 100, and the censoring fraction was fixed at 
50%. As shown in Figure 2, difference in power between 
SRC B and SC B and between SRC W and SC W increases as 
gene effect size increases, when gene effect size increases, 
the quadratic term can more accurately account for var- 
iance in the effect. Thus, treating the compound score as a 
random covariate in the Cox regression model provides 
greater power. 

For the first scenario, tests based on SRC B have higher 
power than those based on SRC W . In the second scenario, 
however, SRC W yields higher power than SRC B . This pat- 
tern change seems to relate to the increase in noise in the 
gene set. For Figure 3, we fixed the total number of genes 
at 30, sample size, 50; censoring fraction, 10% in one gene 
set, and show the power curves as gene effect and number 
of noise genes increase. There are eight lines in Figure 3. 
All solid lines indicate power curves for SRC B while all 
long-dash lines indicate power curves for SRCw Situation 
a has 30 disease related genes, b has 10 disease related 
genes and 20 noise genes; c has 5 disease related genes 
and 25 noise genes, and d has 3 disease related genes and 
27 noise genes. As gene effect increases, all powers 
increase. As the number of noise genes increases (from 
a to d), however, the powers of SRC B and SRC W decrease. 
With no noise genes (case a), the power of SRC B is always 
greater than that of SRC W . As the number of noise genes 
increases, however, the power of SRC W gradually improves 
over that of SRC B . This results from the fact that the 
compound covariate SRCw takes into account the variance 
of p . Similar results were obtained with larger sample size 
(75, 100) and censoring fraction (30%, 50%). Consequently, 
if prior biological knowledge indicates many noise genes 
in a given gene set/pathway, we recommend use of the 
compound covariate SRC W over SRC B . 

Figure 4 shows the power curve for SRC B with different 
sample sizes and different numbers of disease related 
genes. In this simulation design, the censoring fraction 
was fixed at 30%. The effect of all disease related genes 
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Figure 3 Power curves with varying gene effect and number of noise genes (sample size, 50, censoring fraction, 10%). 
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was set to 0.5. As expected, the power increases as sam- 
ple size grows. Power decreases, however, as the number 
of disease genes increase. Under this specified setting, if 
we need 80% power and have sample size 100, we can 
include about 90 disease related genes in this analysis. If 
we have sample size 60, however, we can only include 
fewer than 30 disease related genes. This result indicates 
the need for greater sample size to preserve power when 
a a gene set includes a large number of disease genes. 

Examples 

In this section, we demonstrate our methodology using 
two examples, an Amsterdam 70-gene breast cancer gene 
signature [1] and a data set involving two pathways for 
non-small-cell lung cancer. All tests with nominal level 
0.05 were applied to the training cohort. The R code for 
obtaining p-values for the proposed testing procedure is 
available from the authors upon request. 

Breast cancer data set 

The well-known Amsterdam 70-gene breast cancer gene 
signature was published by Van't Veer [1]. To evaluate 
the previously established 70-gene prognosis file, Van de 
Vijver [16] further classified an additional 295 consecutive 
patients with stage I or II breast cancer to validate the 
breast cancer gene signature. Because the 295 patients are 
independent of the original data, we re-analyzed this data 
set using our methodology. In this data set, patients were 



followed for a median of 7.2 years, with 79 observed 
deaths. The survival curve is shown in Figure 5 (a) and the 
testing results, including estimated coefficients (Coef.), 
relative risk (RR), and p-values are given in Table 3. 

Although all coefficients and relative risks are very close, 
the p-values are very different. When using DC B and 
DC W , the p-values are 8.6 x 10~ 13 and 1.1 x 10" 13 , respec- 
tively. When treating the compound covariate as fixed, the 
p-values of SC B and SC W are 1.1 x 10' 7 and 1.3 x 10" 7 . 
When using our procedure, the p-values of SRC B and 
SRC w zee 1.9 x 10~ 8 and 1.8 x 10~ 8 . Although the results 
remain significant regardless of method, we achieve appro- 
priate p-values for the training cohort, showing that the 
70-gene prognosis signature can be used to evaluate early 
events in breast cancer patients. We get consistent conclu- 
sion with the other researches [17,16]. 

Non-small cell lung cancer data set 

We also tested our method by applying it to a publicly 
available non-small-cell microarray data set downloaded 
from National Center for Biotechnology Information 
Gene Expression Omnibus (GSE14814). There are 90 
gene expression profiling conducted on mRNA isolated 
from frozen tumor samples. In this example, two well- 
known cancer-related pathways were used to test associa- 
tion with survival outcomes for demonstration purposes. 
The first signaling pathway is the p53 pathway, which is 
induced by a number of stress signals, including DNA 
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(a) Breast cancer data set 



(b) Non-small-cell Lung Cancer data set 
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Figure 5 Kaplan-Meier curves for two data sets 



damage, oxidative stress, and activated ontogenesis. The 
other pathway is the NOD-like receptor signaling path- 
way, which has been associated with an increased risk for 
the development of different types of cancer [18]. There 
are 61 genes in p53 pathway and 54 genes in NOD-like 
receptor signaling pathway, respectively. The median 
follow-up time of these patients was 5.4 years, and the 
number of observed deaths was 29. Figure 5 (b) shows the 
survival curve, and the test results are given in Table 4. 

To summarize all the information, two compound cov- 
ariates were used. As shown, conventional Cox regression 
yields overall p-values that are strongly statistically signifi- 
cant (2.29 x 10' 6 for DC B and 1.85 x 10' 5 for DC W ). When 
treating the compound score as a fixed covariate and 
using a split data set, however, the p-values of SC B and 
SC W become 0.432 for both. When treating the compound 
score as a random covariate, the p-values of SRCb and 
SRC W become 0.236 and 0.358, respectively. Such 



divergent p-values suggest that an inappropriate method 
may well lead to misleading results. 

Concluding remarks 

In this paper, we focused on survival outcomes and 
proposed a feasible and correct method for testing the 
compound covariate to evaluate its association with sur- 
vival outcomes for training cohort data. We have 
described the use of a random covariate, SRC B ISRC W , to 
achieve correct testing results for training cohort data 
and moderately improve power as compared to the use 

Table 4 Non-small-cell lung cancer data set analysis 



Table 3 Breast cancer data set analysis 



Method 



Coef 



RR 



p-value 



SRC B 


0.052 


1.12 


1.9 X 10~ 8 


SRC\/v 


0.022 


1.12 


1.8 X 10~ s 


SQ 


0.093 


1.10 


1.1 x 10~ 7 


sc w 


0.040 


1.04 


1.3 x 10~ 7 


DC B 


0.078 


1.08 


8.6 X 10~ 13 


DC W 


0.015 


1.02 


1.1 x 10" 13 



Method 


Pathway 


Coef 


RR 


p-value 


Overall p-value 


SRC B 


NOD 


0.033 


1.0013 


0.59 


0.236 




P53 


0.037 


1.0044 


0.67 




SRC W 


NOD 


0.016 


1 .0063 


0.37 


0.358 




P53 


0.001 


1.0002 


0.99 




sc B 


NOD 


0.077 


1.08 


0.36 


0.432 




P53 


0.015 


1.01 


0.90 




SC W 


NOD 


0.034 


1.03 


0.24 


0.432 




P53 


-0.01 


0.99 


0.74 




DC B 


NOD 


0.072 


1.07 


0.37 


2.29 X 1 Q 6 




P53 


0.314 


1.37 


0.003 




DC W 


NOD 

DC 2 


0.019 
0.055 


1.02 
1.06 


0.21 


1 .85 X 1 0~ 5 



To evaluate the established 70 breast cancer gene signature published by 
Van't Veer with ther proposed method. 



To evaluate the 90 gene expression profiling from National Center for 
Biotechnology Information Gene Expression Omnibus (GSE14814). The first 
signaling pathway is the p53 pathway. The other pathway is the NOD-like 
receptor signaling pathway. 
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of SC B /SC W . Simulation study shows that our proposed 
method performs consistently better than SC B ISC W , 
because the quadratic term utilized in the SRC B ISRC W 
method takes into account error in the compound cov- 
ariate. We further found that an increase in sample size 
improves power when there is a high proportion of cen- 
sored data. If the gene set of interest includes noise 
genes, we suggest that the compound covariate SRC W is a 
better choice than SRC B ; whether noise genes or non- 
functional^ related genes are hidden in gene set is a 
judgment call for a geneticist. In addition, we contend 
that a flaw of biomedical papers concerned with such 
topics report an bias p-value based on flawed compound 
covariate analysis for the same training data set. In this 
paper, we use a well-known 2-fold concept, with one part 
of the data to built compound covariate and the remain- 
der part for testing if there is association between survival 
outcomes and the score to ensure correct p-values in the 
training data set. Note that we need to check the propor- 
tional hazards assumption. 

Our method can simultaneously test for more than one 
gene set in a training cohort data. More generally, this 
procedure can be applied not only for survival outcomes, 
but also for binary or continuous outcomes. The weighted 
flexible compound covariate method WFCCM [19], an 
extension of the compound covariate, also allows for use 
of the method of statistical analysis presented here. In 
addition, this method can easily be extended to consider 
the interaction between random covariates and clinical 
observed covariates, as 

h (t\z, w) = h Q (t) E [exp (y 0 z + y]w + y\ (zw))] 

using the same analysis procedure. The chosen weight 
P or u> can be adjusted by the other clinical observed 
covariates in the proposed framework. Our method, how- 
ever, cannot be used to test the interaction between two 
random covariates, because of the complexity of specifying 
the distribution for the interaction between two random 
covariates; this is an area worthy of further investigation. 
Another one potential practical concern of the proposed 
method is that sample size must not be too small, higher 
fractions of censored data create the need for further 
increased sample size. Similarly, to achieve high power 
when studying a large number of genes, greater sample 
size is needed. When studying a large number of genes, 
ignoring the covariance that exists between genes does not 
influence the type I error rate, however, taking the covar- 
iance into account may increase power. Further research 
is required to address these limitations. Note that the per- 
mutation test (e.g., [20] ) might be another method to cal- 
culate an appropriate p-value for the training dataset, 
however, with the permutation test, weights are not easily 



adjusted by the other covariates. Even for a small gene set, 
this approach may appear too expensive in computer time. 

Additional material 



Additional file 1: The explicit forms of a, b and c Additional file 1 is a 
PDF file which shows the explicit forms of a, b and c. Then, the score 
statistic can be derived. 

Additional file 2: The derivation of variance for compound 
covariates. Additional file 2 is a PDF file which shows the derivation of 
variance for compound covariates. 
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